Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Map and Arrays of Maps in BQ for StorageWrites for Beam Rows #22179

Conversation

prodriguezdefino
Copy link
Contributor

@prodriguezdefino prodriguezdefino commented Jul 7, 2022

Currently BigQuery table schema utility, and the implementation for StorageWrites for Beam Rows, does not support sending records with Maps as part of their schema.

This PR adds that functionality transforming the Map into a Message type which contains two fields key and value respecting the types coming from upstream while mimicking the behavior when using TableRows to the BigQueryIO PTransform.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@prodriguezdefino
Copy link
Contributor Author

Run Java PreCommit

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2022

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 8, 2022
@prodriguezdefino
Copy link
Contributor Author

fixes #23618

@prodriguezdefino prodriguezdefino marked this pull request as ready for review October 12, 2022 22:25
@prodriguezdefino prodriguezdefino changed the title Support Map and Arrays of Maps in BQ for StorageWrites Support Map and Arrays of Maps in BQ for StorageWrites for Beam Rows Oct 31, 2022
@prodriguezdefino
Copy link
Contributor Author

prodriguezdefino commented Nov 21, 2022

.remove-labels stale

@reuvenlax
Copy link
Contributor

Can you rebase and resolve, since that file has since been refactored?

@prodriguezdefino
Copy link
Contributor Author

Can you rebase and resolve, since that file has since been refactored?

done

@@ -229,6 +243,8 @@ private static Object messageValueFromRowValue(
if (value == null) {
if (fieldDescriptor.isOptional()) {
return null;
} else if (fieldDescriptor.isRepeated()) {
return Lists.newArrayList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collections.emptyList()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

arrayElementType.getTypeName().isCollectionType()
|| arrayElementType.getTypeName().isMapType()
? true
: false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove redundant ternary operator. Also no need for Boolean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (what was I thinking)


if (shouldFlatMap) {
valueStream = valueStream.flatMap(vs -> ((List) vs).stream());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain why flatMap is correct here?

Copy link
Contributor Author

@prodriguezdefino prodriguezdefino Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because BQ does not support arrays of maps, this code is making the decision of flattening those structures (the if condition is computed based on that particular scenario).

[
map1 [k1,v2] [k2,v2] ,
map2 [k3,v3],
map3 [k4,v4] [k5,v5],
]

------------------------- to
[
record {key:k1, value:v1},
record {key:k2, value:v2},
record {key:k3, value:v3},
record {key:k4, value:v4},
record {key:k5, value:v5}
]

It respects the order in the array and the inherent order of iteration in the maps, but it won't check for repeated keys across the maps in the original array.

FieldType keyFieldType,
FieldType valueFieldType,
Map.Entry<Object, Object> entryValue) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

FieldDescriptor keyFieldDescriptor =
Preconditions.checkNotNull(descriptor.findFieldByName("key"));
@Nullable Object key = toProtoValue(keyFieldDescriptor, keyFieldType, entryValue.getKey());
if (key != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are null keys allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, AFAICT the backing Map in the Row object is a HashMap (with expected size) so I would say that null keys are allowed in a map property for a Row object.
This code only ignores setting the value in the proto if null was the value present on the map's key.

@prodriguezdefino
Copy link
Contributor Author

Run Java PreCommit

@prodriguezdefino
Copy link
Contributor Author

Run Java_GCP_IO_Direct PreCommit

@prodriguezdefino
Copy link
Contributor Author

Is there a way to rerun Java Tests / Java Unit Tests (ubuntu-latest) (pull_request) those are failing because:

org.apache.beam.fn.harness.jmh.ProcessBundleBenchmarkTest > testStateWithoutCaching FAILED
[176](https://github.com/apache/beam/actions/runs/3595444602/jobs/6054940468#step:5:177)
    java.util.concurrent.ExecutionException at ProcessBundleBenchmarkTest.java:47
[177](https://github.com/apache/beam/actions/runs/3595444602/jobs/6054940468#step:5:178)
        Caused by: java.lang.RuntimeException at ProcessBundleBenchmark.java:175
[178](https://github.com/apache/beam/actions/runs/3595444602/jobs/6054940468#step:5:179)
            Caused by: java.lang.RuntimeException at UnboundedScheduledExecutorService.java:316
[179](https://github.com/apache/beam/actions/runs/3595444602/jobs/6054940468#step:5:180)
                Caused by: java.lang.InterruptedException at FutureTask.java:418

Not related with this changes, but not sure how to ensure this works without pushing more commits.

@reuvenlax
Copy link
Contributor

Has this been tested e2e (with BigQuery)?

@github-actions github-actions bot removed the stale label Jan 27, 2023
@prodriguezdefino
Copy link
Contributor Author

Has this been tested e2e (with BigQuery)?

yes.

map, job id: 2023-01-30_16_31_29-9471872084021240643
Screenshot 2023-01-30 at 4 47 24 PM

list/array of maps job id: 2023-01-30_16_32_57-10318556278839602747
schema on BQ:
Screenshot 2023-01-30 at 4 47 44 PM

@reuvenlax
Copy link
Contributor

Run Java_GCP_IO_Direct PreCommit

1 similar comment
@prodriguezdefino
Copy link
Contributor Author

Run Java_GCP_IO_Direct PreCommit

@github-actions
Copy link
Contributor

github-actions bot commented Apr 1, 2023

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added stale and removed stale labels Apr 1, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jun 8, 2023
@github-actions
Copy link
Contributor

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Jun 15, 2023
@JohnZZGithub
Copy link
Contributor

Could we merge the PR? This is necessary for us to ingest data to BigQuery.

@prodriguezdefino
Copy link
Contributor Author

prodriguezdefino commented Sep 20, 2024 via email

@JohnZZGithub
Copy link
Contributor

This PR has been closed for a while, I can take some time to migrate the changes to a new branch from master.

On Thu, Sep 19, 2024 at 6:03 PM zz @.> wrote: Could we merge the PR? This is necessary for us to ingest data to BigQuery. — Reply to this email directly, view it on GitHub <#22179 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2HMFYXRBLP2I6TEUYYPJTZXNX67AVCNFSM6AAAAABORALTZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRSGUYDSMJTGQ . You are receiving this because you authored the thread.Message ID: @.>

Thanks a lot Pablo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants